Methodology
Dataset
A dataset of the number of daily reported total, confirmed COVID 19 cases were downloaded from The Humanitarian Data Exchange, a service provided by the United Nations Office for the Coordination of Humanitarian Affairs. Data on USA caseload was filtered and stored.
People mobility data was downloaded from Facebook Movement Range Maps, which contains measurement of movements compared to a baseline before social distancing and lock downs. Movement data for USA was filtered and was combined with daily reported number of cases. The combined dataset was split to obtain a 75-day assessment window and a training dataset for modeling.
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 14 hours 12 minutes
H2O cluster timezone: Asia/Colombo
H2O data parsing timezone: UTC
H2O cluster version: 3.32.1.5
H2O cluster version age: 11 days
H2O cluster name: H2O_started_from_R_Supun_iqy810
H2O cluster total nodes: 1
H2O cluster total memory: 0.68 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 3.6.3 (2020-02-29)
Modeling approach
The modeling pipeline was carried out in two pathways; incorporating mobility data, and without mobility data, in R language using open-source H20 machine learning libraries, creating and evaluating 6 time-series models and ensembles at each stage.
Stage 1 – Without mobility data
Original training set consists only the date as the predictor. The date has been subjected to various transformations to generate features like time lags, days, weeks, and holidays. Summary statistics of the 6 generated models are tabulated below.
Converting to H2OFrame...
Training H2O AutoML...
Leaderboard:
[7 rows x 6 columns]
Using top model: GBM_3_AutoML_20210816_094202
The best performing model is a Gradient Boosting Machine with 61 trees with depth of 8 levels and number of leaves ranging from 11 to 33 (mean of 23.22 leaves).
parsnip model object
Fit time: 36.8s
H2O AutoML - Gbm
--------
Model: Model Details:
==============
H2ORegressionModel: gbm
Model ID: GBM_3_AutoML_20210816_094202
Model Summary:
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 56930353
RMSE: 7545.221
MAE: 3388.13
RMSLE: NaN
Mean Residual Deviance : 56930353
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 166250547
RMSE: 12893.82
MAE: 6204.571
RMSLE: NaN
Mean Residual Deviance : 166250547
Cross-Validation Metrics Summary:
The final model forecast for the 45-day assessment window and the next 6 months are as follow.
No Mobility data
[7 rows x 6 columns]
Stage II – With mobility data
In addition to transformed data regarding date and time, this model has two additional predictors; Change in Movement and Stay Put. Change in Movement looks at how much people are moving around and compares it with a baseline period that predates most social distancing measures, while Stay Put looks at the fraction of the population that appear to stay within a small area during an entire day.
The following graph shows how the Change in movement fluctuated over the time.
Variation of Stay Put is shown in the below graph
The effect of mobility on the epidemic curve can be considered to be ‘lagged’. In other words, the effect of change of people’s mobility pattern today will be reflected in the epidemic curve in the future.
Therefore we engineered lagged features from the original two mobility parameters to utilize as predictors for the time-series model. The maximum lag period was taken as one month; assuming if there is any effect of change in mobility over the caseload, it would be seen by one month from the day of mobility change.
In stage II, 6 models and ensembles were tuned and evaluated. Model metrics of them are summarized in the below table.
Converting to H2OFrame...
Training H2O AutoML...
Leaderboard:
[7 rows x 6 columns]
Using top model: StackedEnsemble_AllModels_AutoML_20210816_094319
The best performing model is a stacked ensemble of 1 Distributed Random Forest (DRF), 3 Gradient Boosting Machines (GBM) and 1 meta-learning Generalized Linear Model (GLM)
parsnip model object
Fit time: 30.4s
H2O AutoML - Stackedensemble
--------
Model: Model Details:
==============
H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_AutoML_20210816_094319
Number of Base Models: 5
Base Models (count by algorithm type):
drf gbm glm
1 3 1
Metalearner:
Metalearner algorithm: glm
Metalearner cross-validation fold assignment:
Fold assignment scheme: AUTO
Number of folds: 5
Fold column: NULL
Metalearner hyperparameters:
H2ORegressionMetrics: stackedensemble
** Reported on training data. **
MSE: 8882627
RMSE: 2980.374
MAE: 1765.243
RMSLE: 0.0649412
Mean Residual Deviance : 8882627
H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 125620074
RMSE: 11208.04
MAE: 6046.679
RMSLE: NaN
Mean Residual Deviance : 125620074
Final model predictions for the assessment window are shown in the following graph.
With Mobility data
[7 rows x 6 columns]
The amount of mobility can be controlled for the future predictions by altering the mobility parameters in the feeding data frame. We simulated a low-mobility, stay-put situation for the next 6 months and the model’s predictions are as follows.
With min mobility
In a high mobility situation, the predictions are like this.
---
title: "COVID 19 Time Series Analysis - USA. With different mobilities"
output:
  html_notebook: default
  html_document:
    df_print: paged
  word_document: default
---
### Methodology

#### Dataset

A dataset of the number of daily reported total, confirmed COVID 19 cases were downloaded from The Humanitarian Data Exchange, a service provided by the United Nations Office for the Coordination of Humanitarian Affairs. Data on USA caseload was filtered and stored.



People mobility data was downloaded from Facebook Movement Range Maps, which contains measurement of movements compared to a baseline before social distancing and lock downs. Movement data for USA was filtered and was combined with daily reported number of cases.
The combined dataset was split to obtain a 75-day assessment window and a training dataset for modeling. 

```{r message=FALSE, warning=FALSE}
knitr::opts_chunk$set(dpi = 300, fig.path = 'fig/', fig.width = 8, fig.height = 6, echo = FALSE, include = TRUE)

library(tidymodels)
library(modeltime.h2o)
library(tidyverse)
library(timetk)
library(readxl)
library(readr)

ass_window = '75 day'
lags_durarion = 15
lags_stretch = 7
pred_horizon = '6 month'
min_mob_adb = -1
min_mob_ar = 1
max_mob_adb = 1
max_mob_ar = 0

d1 <- read_delim("data/fb1", "\t", escape_double = FALSE, trim_ws = TRUE)
d2 <- read_delim("data/fb2", "\t", escape_double = FALSE, trim_ws = TRUE)


us <- d1 %>% filter(country == 'USA') %>% rbind(d2 %>% filter(country == 'USA')) %>% group_by(ds) %>% summarise(adb = mean(all_day_bing_tiles_visited_relative_change), ar = mean(all_day_ratio_single_tile_users)) %>% rename(Date = ds)

gb <- read_csv("data/time_series_covid19_confirmed_global_narrow2.csv",  col_types = cols(Date = col_date(format = "%m/%d/%Y")))


#cv <- cv %>% left_join(lk)

##US
cv <- gb %>% filter(Country == 'US') %>% select(Date, n = Value) %>% left_join(us) %>% arrange(n) %>% mutate(lag = lag(n,1)) %>% mutate(cum = n) %>% mutate(n = cum - lag) %>% select(Date,n,adb,ar)

cv2 <- gb %>% filter(Country == 'US') %>% select(Date, n = Value)%>% arrange(n) %>% mutate(lag = lag(n,1)) %>% mutate(cum = n) %>% mutate(n = cum - lag) %>% select(Date, n)

## Lagging features
cv <- cv %>% tk_augment_lags(.,.value = c(adb, ar), .lags = seq(1,lags_durarion, by = lags_stretch))
```


```{r fig.height=6, fig.width=10, message=FALSE, warning=FALSE, echo=FALSE}
cv %>% plot_time_series(.date_var = Date, n)
#cv %>% plot_time_series(.date_var = Date, adb)
#cv %>% plot_time_series(.date_var = Date, ar)

splits <- time_series_split(cv, assess = ass_window, cumulative = TRUE)
splits2 <- time_series_split(cv2, assess = ass_window, cumulative = TRUE)

recipe_spec <- recipe(n ~ ., data = training(splits)) %>%
    step_timeseries_signature(Date)
recipe_spec2 <- recipe(n ~ ., data = training(splits2)) %>%
    step_timeseries_signature(Date)

train_tbl <- training(splits) %>% bake(prep(recipe_spec), .)
test_tbl  <- testing(splits) %>% bake(prep(recipe_spec), .)

train_tbl2 <- training(splits2) %>% bake(prep(recipe_spec), .)
test_tbl2  <- testing(splits2) %>% bake(prep(recipe_spec), .)


h2o.init(
    nthreads = -1,
    ip       = 'localhost',
    port     = 54321
)
h2o.no_progress()
model_spec <- automl_reg(mode = 'regression') %>%
    set_engine(
         engine                     = 'h2o',
         max_runtime_secs           = 100, 
         max_runtime_secs_per_model = 20,
         max_models                 = 5,
         nfolds                     = 5,
         exclude_algos              = c(),
         verbosity                  = 5,
         seed                       = 786
    ) 
```

#### Modeling approach

The modeling pipeline was carried out in two pathways; incorporating mobility data, and without mobility data, in R language using open-source H20 machine learning libraries, creating and evaluating 6 time-series models and ensembles at each stage.


Stage 1 – Without mobility data

Original training set consists only the date as the predictor. The date has been subjected to various transformations to generate features like time lags, days, weeks, and holidays. Summary statistics of the 6 generated models are tabulated below. 


```{r}
model_fitted2 <- model_spec %>%
    fit(n ~ ., data = train_tbl2)
```

The best performing model is a Gradient Boosting Machine with 61 trees with depth of 8 levels and number of leaves ranging from 11 to 33 (mean of 23.22 leaves).

```{r}
model_fitted2
```


The final model forecast for the 45-day assessment window and the next 6 months are as follow.



#### No Mobility data
```{r fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
modeltime_tbl <- modeltime_table(
    model_fitted2
) 

modeltime_tbl %>%
  modeltime_calibrate(test_tbl2) %>%
    modeltime_forecast(
        new_data    = test_tbl2,
        actual_data = cv2,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive = TRUE
    )

data_prepared_tbl <- bind_rows(train_tbl2, test_tbl2)

future_tbl <- data_prepared_tbl %>%
    future_frame(.length_out = pred_horizon)

future_prepared_tbl <- bake(prep(recipe_spec), future_tbl)

refit_tbl <- modeltime_tbl %>%
    modeltime_refit(data_prepared_tbl)

refit_tbl %>%
    modeltime_forecast(
        new_data    = future_prepared_tbl,
        actual_data = data_prepared_tbl,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive = TRUE,.conf_interval_show = TRUE
    )
```





#### Stage II – With mobility data

In addition to transformed data regarding date and time, this model has two additional predictors; Change in Movement and Stay Put. Change in Movement looks at how much people are moving around and compares it with a baseline period that predates most social distancing measures, while Stay Put looks at the fraction of the population that appear to stay within a small area during an entire day.


The following graph shows how the Change in movement fluctuated over the time. 

```{r}
cv %>% plot_time_series(.date_var = Date, adb)
```

Variation of Stay Put is shown in the below graph

```{r}
cv %>% plot_time_series(.date_var = Date, ar)
```

The effect of mobility on the epidemic curve can be considered to be ‘lagged’. In other words, the effect of change of people’s mobility pattern today will be reflected in the epidemic curve in the future.

Therefore we engineered lagged features from the original two mobility parameters to utilize as predictors for the time-series model. The maximum lag period was taken as one month; assuming if there is any effect of change in mobility over the caseload, it would be seen by one month from the day of mobility change.

In stage II, 6 models and ensembles were tuned and evaluated. Model metrics of them are summarized in the below table.

```{r}
model_fitted <- model_spec %>%
    fit(n ~ ., data = train_tbl)
```

The best performing model is a stacked ensemble of 1 Distributed Random Forest (DRF), 3 Gradient Boosting Machines (GBM) and 1 meta-learning Generalized Linear Model (GLM)

```{r}
model_fitted
```


Final model predictions for the assessment window are shown in the following graph.


#### With Mobility data

```{r fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
modeltime_tbl <- modeltime_table(
    model_fitted
) 

modeltime_tbl %>%
  modeltime_calibrate(test_tbl) %>%
    modeltime_forecast(
        new_data    = test_tbl,
        actual_data = cv,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive = TRUE
    )

data_prepared_tbl <- bind_rows(train_tbl, test_tbl)

refit_tbl <- modeltime_tbl %>%
    modeltime_refit(data_prepared_tbl)
```


The amount of mobility can be controlled for the future predictions by altering the mobility parameters in the feeding data frame. We simulated a low-mobility, stay-put situation for the next 6 months and the model’s predictions are as follows.


#### With min mobility

```{r fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
future_tbl <- data_prepared_tbl %>%
    future_frame(.length_out = pred_horizon)

future_prepared_tbl <- bake(prep(recipe_spec), future_tbl) %>% mutate(adb = min_mob_adb, ar = min_mob_ar) %>% tk_augment_lags(.,.value = c(adb, ar), .lags = seq(1,lags_durarion, by = lags_stretch))


refit_tbl %>%
    modeltime_forecast(
        new_data    = future_prepared_tbl,
        actual_data = data_prepared_tbl,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive = TRUE,.conf_interval_show = TRUE
    )
```

In a high mobility situation, the predictions are like this.

#### With max mobility

```{r fig.height=6, fig.width=10, message=FALSE, warning=FALSE}
future_prepared_tbl <- bake(prep(recipe_spec), future_tbl) %>% mutate(adb = max_mob_adb, ar = max_mob_ar) %>% tk_augment_lags(.,.value = c(adb, ar), .lags = seq(1,lags_durarion, by = lags_stretch))

refit_tbl %>%
    modeltime_forecast(
        new_data    = future_prepared_tbl,
        actual_data = data_prepared_tbl,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive = TRUE,.conf_interval_show = TRUE
    )

```

### Comparison of the two final models; is mobility data truly informative?

The following graph compares the prediction error made by the two models; without mobility data and with mobility data.

```{r message=FALSE, warning=FALSE}
sum <- read_csv("data/sum.csv")
sum %>% pivot_longer(cols = 'RMSE':'MAE_cv') %>% ggplot(aes(x = name, y = value, col = Experiment, fill = Experiment)) + geom_col(position = 'dodge') + theme_bw() + xlab('Metric')
```

It can be clearly seen that the error for both training data and combined holdout cross-validation samples is less in the model with mobility data. In other words, the mobility parameters actually play a role in shaping the epidemic curve. It can also be seen by observing the two simulated future 6-month periods. The low-mobility stay-put situation is forecasted to have a relatively low total number of infected people than the high mobility period. 

#### What is new?

##### This experiment was not based on epidemiological disease models to describe and forecast the trajectory of the caseload, but was purely data-driven. We tried to utilize machine learning algorithms to identify potential predictors and to learn the best model parameters which minimized the error, or in other words which fit best to the observed data. 


